{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised Learning Algorithms: Linear vs Polynomial Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*In this template, only **data input** and **input/target variables** need to be specified (see \"Data Input & Variables\" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Run to import the required libraries.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Input and Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Data Input\n",
    "# df = \n",
    "\n",
    "### Defining Variables  \n",
    "# X = \n",
    "# y = \n",
    "\n",
    "### Data Input Example \n",
    "df = pd.read_csv('https://ibm.box.com/shared/static/q6iiqb1pd7wo8r3q28jvgsrprzezjqk3.csv')\n",
    "\n",
    "X = df[['horsepower']]\n",
    "y = df['price']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. The Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1. Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Run to build the Linear Regression model.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "linear model coeff (w): [157.09522969]\n",
      "linear model intercept (b): -3574.121\n",
      "R-squared score (training): 0.623\n",
      "R-squared score (test): 0.666\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                   random_state = 0)\n",
    "\n",
    "# Linear regression def\n",
    "linreg = LinearRegression().fit(X_train, y_train)\n",
    "\n",
    "### intercept & coefficient, R-squared for training & test data set\n",
    "print('linear model coeff (w): {}'\n",
    "     .format(linreg.coef_))\n",
    "print('linear model intercept (b): {:.3f}'\n",
    "     .format(linreg.intercept_))\n",
    "print('R-squared score (training): {:.3f}'\n",
    "     .format(linreg.score(X_train, y_train)))\n",
    "print('R-squared score (test): {:.3f}'\n",
    "     .format(linreg.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2. Polynomial Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Run to build the Polynomial Regression model.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(poly deg 2) linear model coeff (w):\n",
      "[0.00000000e+00 1.40923904e+02 6.48999886e-02]\n",
      "(poly deg 2) linear model intercept (b): -2683.607\n",
      "(poly deg 2) R-squared score (training): 0.623\n",
      "(poly deg 2) R-squared score (test): 0.670\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "'''\n",
    "Now we transform the original input data to add\n",
    "polynomial features up to degree 2\n",
    "\n",
    "'''\n",
    "\n",
    "poly = PolynomialFeatures(degree=2)\n",
    "X_poly = poly.fit_transform(X)\n",
    "\n",
    "# train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_poly, y,\n",
    "                                                   random_state = 0)\n",
    "# Polynomial regression def\n",
    "linreg = LinearRegression().fit(X_train, y_train)\n",
    "\n",
    "### intercept & coefficient, R-squared for training & test data set\n",
    "print('(poly deg 2) linear model coeff (w):\\n{}'\n",
    "     .format(linreg.coef_))\n",
    "print('(poly deg 2) linear model intercept (b): {:.3f}'\n",
    "     .format(linreg.intercept_))\n",
    "print('(poly deg 2) R-squared score (training): {:.3f}'\n",
    "     .format(linreg.score(X_train, y_train)))\n",
    "print('(poly deg 2) R-squared score (test): {:.3f}\\n'\n",
    "     .format(linreg.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3. Polynomial Regression with Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run to build the Polynomial Regression model with a regularization penalty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(poly deg 2 + ridge) linear model coeff (w):\n",
      "[0.00000000e+00 1.40908895e+02 6.49573697e-02]\n",
      "(poly deg 2 + ridge) linear model intercept (b): -2682.747\n",
      "(poly deg 2 + ridge) R-squared score (training): 0.623\n",
      "(poly deg 2 + ridge) R-squared score (test): 0.670\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "'''\n",
    "Addition of many polynomial features often leads to\n",
    "overfitting, so we often use polynomial features in combination\n",
    "with regression that has a regularization penalty, like ridge\n",
    "regression.\n",
    "'''\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_poly, y,\n",
    "                                                   random_state = 0)\n",
    "linreg = Ridge().fit(X_train, y_train)\n",
    "\n",
    "### intercept & coefficient, R-squared for training & test data set\n",
    "print('(poly deg 2 + ridge) linear model coeff (w):\\n{}'\n",
    "     .format(linreg.coef_))\n",
    "print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'\n",
    "     .format(linreg.intercept_))\n",
    "print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'\n",
    "     .format(linreg.score(X_train, y_train)))\n",
    "print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'\n",
    "     .format(linreg.score(X_test, y_test)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
